Abstract

This is a attempted partial reproduction study of J. Chakraborty’s 2021 study ‘Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S.’. In it, I’ll reproduce Chakraborty’s workflow extracting the covid incidence rate for different disability demographic groups, assessing the correlation between county-level case incidence and the population of these demographics. This section of the study uses Covid case data from August 1st, 2020. Because this was so early in the pandemic, spread across space definitely plays a role in case incidence. To account for this Chakraborty creates high incidence clusters as a spatial variable input to Generalized Estimating Equations.

Instead of running GEE’s, I’ll deviate from Chakraborty’s work by clustering county-level cases using kulldorf spatial filtering to generate case clusters and assess spatial scale of Covid spread. This will also serve as a test of the comperability of the spatialepi and SATSCAN implementation of this test, a potentially useful tool for open-source reproductions of epidemiological studies that utilize kulldorff statistics.

Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. DOI:[10.1016/j.dhjo.2020.101007](https://doi.org/10.1016/j.dhjo.2020.101007)

Study metadata

Original study spatio-temporal metadata

  • Spatial Coverage: The 49 contiguous states in the U.S.
  • Spatial Resolution: Counties
  • Spatial Reference System: N/A
  • Temporal Coverage: 1/22/20 - 8/1/2020
  • Temporal Resolution: All recorded Covid 19 cases from the beginning of the pandemic through August 1st, 2020. The data on disability and sociodemographic characteristics come from the U.S. Census American Community Survey (ACS) five-year estimates for 2018 (2014-2018)

Study design

This work is a partial reproduction study to assess the feasibility of reproduction of geographic research. Because the original study is likely unable to be replicated perfectly solely from the avalible manuscript, planned deviations will change the end result of the study. I have two goals for this study:

  1. To assess the correlation between disabled populations and covid 19 incidence rate in the initial pandemic surge, mirroring Chakraborty’s methods to ensure validity.
  2. assess the clustering of case incidence between counties, with particular intent to assess the comparability of the spatialepi and SATSCAN implementation of clustering.

A successful study result will be able to reach similar findings to Chakraborty in relation to the correlation between disability and incidence rate, as well as create county clusters of high covid incidence using kulldorf filtering to compare to the original SATSCAN output.

The original study is a observational study that seeks to answer the question “is Covid 19 incidence higher in counties that have a higher percentage of disabled population?”. This research question is ultimately broke into 5 specific hypothesis, investigating whether people with disabilities are effected differently depending on their race, ethnicity, poverty level, age, and biological sex.

The original study was conducted in ArcGIS and SaTScan software; in this reproduction I’ll attempt to match methods using R instead.

Materials and procedure

Computational environment

Data and variables

All variables in this study were derived from secondary data. There are no experimentally manipulated variables in this experiment. Eighteen independent variables, a percentage of total disabled persons per county and seventeen ‘disaggregated’ categories that break down socio-demographic characteristics of the disabled population. COVID-19 incidence rate can be seen as the dependent variables.

US CENSUS American Community Survey

  • Title: County Level Poverty and Disability Data, Derived from 2014-2018 American Community Survey 5-year Estimates
  • Abstract: This dataset documents poverty and disability data on the county level. The US Census Bureau collects the data in order to produces information on social, economic, housing, and demographic characteristics about the nation’s population every year, which provides an important tool for communities to use and see how they are changing.
  • Spatial Coverage: United States, OSM link
  • Spatial Resolution: Specify the spatial resolution as a scale factor, description of the level of detail of each unit of observation (including administrative level of administrative areas), and/or or distance of a raster GRID size
  • Spatial Representation Type: Specify the model of spatial data representation, e.g. one of vector, grid, textTable, tin (triangulated irregular network), etc. If the type is vector, also specify the geometry type as in the OGC Simple Feature Access standard (https://www.ogc.org/publications/standard/sfa/) , e.g. POINT, LINESTRING, MULTIPOLYGON, etc.
  • Spatial Reference System: Specify the geographic or projected coordinate system for the study
  • Temporal Coverage: 2014-2018
  • Lineage: We computed the summary statistics, including the min, max, mean, and standard deviation. We processed the data after retrieving it by excluding states that irrelevant to our study area. We also checked for data duplication and omission: we found one county with missing disability and poverty data, for which we replaced the missing data with zeros.
  • Distribution: The data is avalible for download from the US Census
  • Constraints: The US Census Data is open access in the public domain.

The variables in question are as follows, with the associated description from the ACS 5 year estimate data which is publicly available online:

Variable Name in Study ACS Variable name
percent of total civilian non-institutionalized population with a disability S1810_C03_001E
Race
percent w disability: White alone S1810_C03_004E
percent w disability: Black alone S1810_C0 3_005E
percent w disability: Native American S1810_C03_006E
percent w disability: Asian alone S1810_C03_007E
percent w disability: Other race S1810_C03_009E
Ethnicity
percent w disability: Non-Hispanic White S1810_C03_0011E
percent w disability: Hispanic S1810_C03_012E
percent w disability: Non-Hispanic non-White (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100
percent w disability: Other race S1810_C03_009E
Poverty
percent w disability: Below poverty level (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100
percent w disability: Above poverty level (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100
Age
percent w disability: 5-17 S1810_C03_014E
percent w disability: 18-34 S1810_C03_015E
percent w disability: 35-64 S1810_C03_016E
percent w disability: 65-74 S1810_C03_017E
percent w disability: 75+ S1810_C03_018E
Biological sex
percent w disability: male S1810_C03_001E
percent w disability: female S1810_C03_003E

JHU Covid Case Dashboard

  • Title: County Level COVID-19 Incidence Rate
  • Abstract: This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE)
  • Spatial Coverage: United States, OSM link
  • Spatial Resolution: Counties
  • Spatial Representation Type: Vector Polygon
  • Spatial Reference System: Mercator
  • Temporal Coverage: 1/22/20-8-1-20
  • Temporal Resolution: N/A
  • Lineage: The specific dataset used was provided by Chakraborty and was been preprocessed in RStudio for analysis. There is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database. Archived data versions can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale.

We selected columns that are relevant to our analysis to simplify the stored data, retaining POP_ESTIMA and Confirmed. These columns were then renamed to pop and cases respectively. We then calculated the COVID rate by dividing cases with the total population. Finally, the geometry of the dataset was dropped. - Distribution: The data is publically available from the JHU website. - Constraints: The data is open access under the Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020

## Reading layer `covidcase080120' from data source 
##   `/Users/lucas/Documents/GitHub/Covid_19_Clustering_Original/data/raw/public/covidcase080120.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 3108 features and 87 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -13885850 ymin: 2817034 xmax: -7452848 ymax: 6340332
## Projected CRS: WGS 84 / Pseudo-Mercator

Prior observations

At the time of this study pre-registration, the authors had prior knowledge of the geography of the study region with regards to the geographic, population, and historical disease spread phenomena to be studied.

For each primary data source, declare the extent to which authors had already engaged with the data:

Bias and threats to validity

Extensive knowledge of the physical and social geography of the United States might mean that some conclusions drawn will be influenced by prior knowledge of movement patterns ect. Additionally, lived experience throughout the course of the pandemic may bias interpretation of patterns in the data.

From a data side, reporting and testing of cases was notoriously inconsistent across municipalities, states, and regions. This means that some counties will likely be incorrectly reported, and that inconsistencies are often correlated (a whole state de-emphasizing testing, resulting in low case counting)

Data transformations

ACS

I will first work with data from the ACS- specifically tables acs_vars_S1810 for disability data and acs5_c18130 for poverty data. Using mutate, I’ll be able to create a population percentage with disabilities by dividing the number of reported disabilities by total population that there exists a disability designation for.

This specifically is NOT the total population, as the total number of disability vs. no disability answers is less than the total population of counties.

# Join poverty data to disability data
acs <- acs %>% left_join(acs_pov, by = "GEO_ID")

# calculate percentages
acs_derived <- mutate(acs,
  dis_pct = S1810_C02_001E / S1810_C01_001E * 100,
  white_pct = S1810_C02_004E / S1810_C01_001E * 100,
  black_pct = S1810_C02_005E / S1810_C01_001E * 100,
  native_pct = S1810_C02_006E / S1810_C01_001E * 100,
  asian_pct = S1810_C02_007E / S1810_C01_001E * 100,
  other_pct =
    (S1810_C02_008E + S1810_C02_009E + S1810_C02_010E) / S1810_C01_001E * 100,
  non_hisp_white_pct = S1810_C02_011E / S1810_C01_001E * 100,
  hisp_pct = S1810_C02_012E / S1810_C01_001E * 100,
  non_hisp_non_white_pct =
    (S1810_C02_001E - S1810_C02_012E - S1810_C02_011E) / S1810_C01_001E * 100,
  bpov_pct = (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100,
  apov_pct = (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100,
  pct_5_17 = S1810_C02_014E / S1810_C01_001E * 100,
  pct_18_34 = S1810_C02_015E / S1810_C01_001E * 100,
  pct_35_64 = S1810_C02_016E / S1810_C01_001E * 100,
  pct_65_74 = S1810_C02_017E / S1810_C01_001E * 100,
  pct_75 = S1810_C02_018E / S1810_C01_001E * 100,
  male_pct = S1810_C02_002E / S1810_C01_001E * 100,
  female_pct = S1810_C02_003E / S1810_C01_001E * 100
)

# select only relevant geographic identifiers and derived percentages. Subsetting strings to select only the last 5 digits, which is the FIPS code format used by the JHU data
acs_derived <- acs_derived %>%
  mutate(fips = substr(GEO_ID, nchar(GEO_ID) - 4, nchar(GEO_ID))) %>%
  select(
    fips,
    statefp = STATE.x,
    county = NAME.x,
    county_st = NAME.x,
    contains("pct")
  )

#JHU Covid Data

Then, I’ll calculate case rate per 100,000 people and divide the JHU covid data with rate data into two products- one with and one without geometry. Having geometry associated with the data breaks things a few steps down the line, so it’s easier to just remove it in one of the versions now. I’ll then join the JHU Covid Case Rate data to the ACS data.

covid_mapping <- covid |>
  mutate(covid_rate = round(covid$cases / covid$pop * 100000, 2))

covid_table <- covid_mapping|>
  st_drop_geometry()

Here’s the resulting map, which is a match to Chakraborty’s Figure 1. Below, I also map disability rates by county which is something that Chakraborty didn’t do but that helps to visualize spatial patterns across counties.

Now I’ll join the JHU dataset to the ACS data by county-level FIPS code, a unique numerical identifier for each county.

# Join COVID incidence rate data to acs data
acs_covid <- covid_mapping %>%
  left_join(acs_derived, by = "fips")

# move covid_rate column prior to disability percentages
acs_covid <- acs_covid %>%
  select(fips, statefp, county, county_st, covid_rate, everything())
tm_disability_rates <- tm_shape(acs_covid) +
  tm_polygons("dis_pct",
    title = "Percent of People with Disability\n(ACS 2014-2018)",
    style = "quantile",
    border.alpha = .2,
    lwd = 0.2,
    palette = "Oranges"
  )+
  tm_shape(state) +
  tm_borders("grey", lwd = .5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

tm_disability_rates

It’s hard to pick out a discernible correlation/non-correlation from these two maps, so let’s dive into the data.

Analysis

Statistical Correlation of Disability Rates and Covid Incidence

Deviation from plan I will also calculate descriptive statistics for each disability subgroup to match the original study’s methodology/results

Planned deviation for reanalysis: I also calculated the Shapiro Wilks test for normality, printed in the table with it’s corresponding p value.

acs_covid_stats <- acs_covid %>%
  st_drop_geometry() %>%
  select(covid_rate, contains("pct")) %>%
  stat.desc(norm = TRUE) %>%
  round(4) %>%
  t() %>%
  as.data.frame() %>%
  select(min, max, mean, SD = std.dev, ShapiroWilk = normtest.W, p = normtest.p)


acs_covid_stats %>%
  kable(caption = "Reproduced Descriptive Statistics",
        align = "c") %>%
  column_spec(2:6, width_min = "5em") %>%
  column_spec(7, width_min = "2em") %>%
  kable_styling(full_width = FALSE)
Reproduced Descriptive Statistics
min max mean SD ShapiroWilk p
covid_rate 0.0000 14257.1700 966.9050 1003.9563 0.7448 0
dis_pct 3.8261 33.7058 15.9540 4.4028 0.9822 0
white_pct 0.8522 33.2588 13.5496 4.6308 0.9840 0
black_pct 0.0000 20.7024 1.4804 2.6563 0.6057 0
native_pct 0.0000 13.7360 0.2844 0.9442 0.2759 0
asian_pct 0.0000 3.4531 0.0905 0.1777 0.5053 0
other_pct 0.0000 15.2363 0.5490 0.6525 0.5730 0
non_hisp_white_pct 0.1026 33.1587 12.8358 4.8086 0.9870 0
hisp_pct 0.0000 25.2618 0.9855 2.1462 0.4173 0
non_hisp_non_white_pct 0.0000 20.9294 2.1327 2.7492 0.6991 0
bpov_pct 0.0000 14.9710 3.5689 1.8499 0.9274 0
apov_pct 3.6526 27.2984 12.4857 3.0542 0.9874 0
pct_5_17 0.0000 5.0828 1.0327 0.4797 0.9527 0
pct_18_34 0.0000 5.5864 1.5646 0.6718 0.9568 0
pct_35_64 1.0121 18.3639 6.3466 2.3024 0.9588 0
pct_65_74 0.0000 12.7298 3.0912 1.1556 0.9488 0
pct_75 0.0000 11.1278 3.8663 1.1916 0.9652 0
male_pct 1.2999 18.1942 8.0579 2.3720 0.9764 0
female_pct 1.9075 19.9434 7.8961 2.2620 0.9815 0

Results from the Shapiro Wilks test indicate strong evidence against normality for the majority of the variables used. This leads us into affirming the results found using the parametric regression with a non-parametric test also. To ensure consistency with Chakraborty’s descriptive statistics I’ll subtract his summary stats table from mine, with zeros across the board signifying a successful reproduction. table1 is Chakraborty’s table, reconstructed from his published paper.

# load original table 1 results
table1 <- read.csv(here("data", "raw", "public", "chakraborty", "table1.csv"))

# subtract original results from reproduced results
(select(acs_covid_stats, min, max, mean, SD) -
  select(table1, min, max, mean, SD)) %>%
  kable(caption = "Descriptive Statistics Comparison",
        align = "c") %>%
  column_spec(2:5, width = "4em") %>%
  kable_styling(full_width = FALSE)
Descriptive Statistics Comparison
min max mean SD
covid_rate 0.0000 0.1700 -0.0950 -0.0437
dis_pct -0.0039 -0.0042 0.0040 0.0028
white_pct 0.0022 -0.0012 -0.0004 0.0008
black_pct 0.0000 0.0024 0.0004 -0.0037
native_pct 0.0000 -0.0040 0.0044 0.0042
asian_pct 0.0000 0.0031 0.0005 -0.0023
other_pct 0.0000 -0.0037 -0.0010 0.0025
non_hisp_white_pct 0.0026 -0.0013 -0.0042 -0.0014
hisp_pct 0.0000 0.0018 -0.0045 -0.0038
non_hisp_non_white_pct 0.0000 -0.0006 0.0027 -0.0008
bpov_pct 0.0000 0.0010 -0.0011 -0.0001
apov_pct 3.6526 -0.0016 0.0057 -0.0058
pct_5_17 0.0000 0.0028 0.0027 -0.0003
pct_18_34 0.0000 -0.0036 0.0046 0.0018
pct_35_64 0.0021 0.0039 -0.0034 0.0024
pct_65_74 0.0000 -0.0002 0.0012 -0.0044
pct_75 0.0000 -0.0022 -0.0037 0.0016
male_pct -0.0001 0.0042 -0.0021 0.0020
female_pct -0.0025 0.0034 -0.0039 0.0020

These columns are JUST off- it looks like potentially Chakraborty has rounded Covid cases to the whole person, which means that our reproduction is mostly correct but will be a tiny bit skewed. apov_pct is also being suspicious, with a higher minimum value in my stats that doesn’t match table1’s results.

carrying on…

I will use cor() to determine the correlation between covid cases and pct disability using pearson’s correlation coefficient, running for each disability subgroup. P values will then be created for each correlation using the following formulas (as per statistics), assigning them to their respective columns with mutate.

df <- sum(!is.na(acs_covid$dis_pct)) - 2

pearsons_r <- acs_covid %>%
  select(where(is.numeric)) %>%
  st_drop_geometry() %>%
  cor(method = "pearson", use = "everything") %>%
  as.data.frame() %>%
  select(r = covid_rate) %>%
  mutate(
    t = abs(r) / sqrt((1 - r^2) / (df)),
    p = pt(t, df, lower.tail = FALSE)
  ) %>%
  round(3) %>%
  rownames_to_column("variable") %>%
  dplyr::filter(variable != "covid_rate")

pearsons_r %>%
  kable(caption = "Reproduced Pearson's R",
        align = "c") %>%
  column_spec(2:4, width = "4em") %>%
  kable_styling(full_width = FALSE)
Reproduced Pearson’s R
variable r t p
pop 0.128 7.215 0.000
cases 0.209 11.891 0.000
x 0.099 5.540 0.000
y -0.412 25.195 0.000
dis_pct -0.060 3.350 0.000
white_pct -0.332 19.612 0.000
black_pct 0.460 28.847 0.000
native_pct 0.019 1.072 0.142
asian_pct 0.094 5.272 0.000
other_pct 0.026 1.460 0.072
non_hisp_white_pct -0.361 21.545 0.000
hisp_pct 0.119 6.686 0.000
non_hisp_non_white_pct 0.442 27.429 0.000
bpov_pct NA NA NA
apov_pct NA NA NA
pct_5_17 0.084 4.688 0.000
pct_18_34 0.063 3.493 0.000
pct_35_64 -0.008 0.460 0.323
pct_65_74 -0.091 5.113 0.000
pct_75 -0.186 10.541 0.000
male_pct -0.134 7.519 0.000
female_pct 0.023 1.305 0.096

This looks similar to Chakraborty’s final Pearson’s table; to compare it, I will find the difference between each table and the significance levels assigned to different variables.

# calculate number of significance stars at p < 0.01 and p < 0.05 levels.
pearsons_r <- mutate(pearsons_r, rp_stars = as.numeric(as.character(cut(p,
  breaks = c(-0.1, 0.01, 0.05, 1),
  labels = c(2, 1, 0)
))))

# join reproduction coefficients to original study coefficients
correlations <- table1 %>%
  dplyr::filter(variable != "covid_rate") %>%
  dplyr::select(variable, or_r = r, or_stars = stars) %>%
  left_join(select(pearsons_r, variable, rp_r = r, rp_stars), by = "variable")

# find difference between coefficient and stars
correlations <- correlations %>%
  bind_cols(rename_with(
    correlations[, 4:5] - correlations[, 2:3],
    ~ paste0(.x, "_diff")
  ))

# find coefficients with different directions
correlations <- correlations %>% mutate(rp_dir_diff = (rp_r > 0) - (or_r > 0))

correlations %>%
  kable(caption = "Compare reproduced and original Pearson's R",
        col.names = c("Variable", "R", "Sig. Level", "R", "Sig. Level", "R", "Sig. Level", "Direction"),
        align = "c") %>%
  kable_styling() %>%
  add_header_above(c(" " = 1, "Original" = 2, "Reproduced" = 2, "Difference" = 3))
Compare reproduced and original Pearson’s R
Original
Reproduced
Difference
Variable R Sig. Level R Sig. Level R Sig. Level Direction
dis_pct -0.056 2 -0.060 2 -0.004 0 0
white_pct -0.326 2 -0.332 2 -0.006 0 0
black_pct 0.456 2 0.460 2 0.004 0 0
native_pct 0.020 0 0.019 0 -0.001 0 0
asian_pct 0.097 2 0.094 2 -0.003 0 0
other_pct 0.028 0 0.026 0 -0.002 0 0
non_hisp_white_pct -0.355 2 -0.361 2 -0.006 0 0
hisp_pct 0.119 2 0.119 2 0.000 0 0
non_hisp_non_white_pct 0.439 2 0.442 2 0.003 0 0
bpov_pct 0.108 2 NA NA NA NA NA
apov_pct -0.146 2 NA NA NA NA NA
pct_5_17 0.083 2 0.084 2 0.001 0 0
pct_18_34 0.066 1 0.063 2 -0.003 1 0
pct_35_64 -0.005 0 -0.008 0 -0.003 0 0
pct_65_74 -0.089 2 -0.091 2 -0.002 0 0
pct_75 -0.181 2 -0.186 2 -0.005 0 0
male_pct -0.131 2 -0.134 2 -0.003 0 0
female_pct 0.028 0 0.023 0 -0.005 0 0

This correlation and significance value is the final product for my reproduction of the initial phase of Chakraborty’s disability correlation work in this paper. The final table should match the corresponding table in his manuscript, which it does … mostly. All of the significance values are slightly off from what Chakraborty’s original calculation, although the levels are all the same with the exception of pct_18_34, which went up in significance when I ran the test. Not sure what the source of this is, but I have checked through the workflow thoroughly so could be attributed to our initial case rounding error.

deviation from published plan

Chakraborty didn’t test the assumption of normality that is required for a pearson’s correlation test (and the variables are non-normally distributed, as seen in my Shapiro-Wilks test); to adjust for this, I’ll conduct a Bivariate nonparametric correlation analysis to ensure the validity of his results.

df <- sum(!is.na(acs_covid$dis_pct)) - 2

spearmans_rho <- acs_covid %>%
  select(where(is.numeric)) %>%
  st_drop_geometry() %>%
  cor(method = "spearman", use = "everything") %>%
  as.data.frame() %>%
  select(rho = covid_rate) %>%
  mutate(
    t = abs(rho) / sqrt((1 - rho^2) / (df)),
    p = pt(t, df, lower.tail = FALSE)
  ) %>%
  round(3) %>%
  rownames_to_column("variable") %>%
  dplyr::filter(variable != "covid_rate")

Compare the Spearman’s rho correlation coefficients to the reproduced Pearson’s r correlation coefficients. Differences are calculated as Spearman’s Rho - Pearson’s R.

# calculate number of significance stars at p<0.01 and P<0.05 levels.
spearmans_rho <- mutate(spearmans_rho, rp_rho_stars = as.numeric(as.character(cut(p,
  breaks = c(-0.1, 0.01, 0.05, 1),
  labels = c(2, 1, 0)
))))

correlations <- correlations[, 1:8] %>%
  left_join(select(spearmans_rho, variable, rp_rho = rho, rp_rho_stars), by = "variable")

corrdiff <- select(correlations, starts_with("rp_rho")) -
  select(correlations, rp_r, rp_stars)

correlations <- correlations %>% bind_cols(rename_with(corrdiff, ~ paste0(.x, "_diff")))
rm(corrdiff)

correlations <- correlations %>% mutate(rp_rho_dir_diff = (rp_rho > 0) - (rp_r > 0))

correlations %>%
  select(variable, rp_r, rp_stars, starts_with("rp_rho")) %>%
  kable(col.names = c("Variable", "R", "Stars", "Rho", "Stars", "Rho - R", "Stars", "Direction"),
        align = "c") %>%
  #column_spec(2:6, width_min = "5em") %>%
  kable_styling() %>%
  add_header_above(c(" " = 1, "Pearson's" = 2, "Spearman's" = 2, "Difference" = 3))
Pearson’s
Spearman’s
Difference
Variable R Stars Rho Stars Rho - R Stars Direction
dis_pct -0.060 2 -0.113 2 -0.053 0 0
white_pct -0.332 2 -0.421 2 -0.089 0 0
black_pct 0.460 2 0.575 2 0.115 0 0
native_pct 0.019 0 -0.084 2 -0.103 2 -1
asian_pct 0.094 2 0.194 2 0.100 0 0
other_pct 0.026 0 0.104 2 0.078 2 0
non_hisp_white_pct -0.361 2 -0.454 2 -0.093 0 0
hisp_pct 0.119 2 0.231 2 0.112 0 0
non_hisp_non_white_pct 0.442 2 0.481 2 0.039 0 0
bpov_pct NA NA NA NA NA NA NA
apov_pct NA NA NA NA NA NA NA
pct_5_17 0.084 2 0.079 2 -0.005 0 0
pct_18_34 0.063 2 0.034 1 -0.029 -1 0
pct_35_64 -0.008 0 -0.020 0 -0.012 0 0
pct_65_74 -0.091 2 -0.151 2 -0.060 0 0
pct_75 -0.186 2 -0.285 2 -0.099 0 0
male_pct -0.134 2 -0.201 2 -0.067 0 0
female_pct 0.023 0 -0.014 0 -0.037 0 -1

Significance changes between the two tests are seen, with native_pct and other_pct being reassigned to the ** significance level and pct_18_34 being bumped down to a * significance. This correlation table, which still shows correlation between various disability and demographic categories and Covid spread, is likely a more correct version of the original study’s correlations.

Clustering Analysis

For the second section of the methods, I’ll attempt to create a map of clusters of counties that have high covid incidence rates. This is somewhat of a reproduction of Chakraborty’s work in that he used county clusters as an input into generalized estimating equations attempting to account for spatial autocorrelation in his disability analysis.

To delineate his high-incidence clusters Chakraborty used SATScan, a spatio-temporal statistical software that was developed for the monitoring of diseases that uses spherical great circle distance calculations from the centroid of each county to determine county cluster extents.Chakraborty then used the county clusters as a way to control for spatial variation in disability and covid rates, something that could be done with more common geographical techniques like a spatial regression. While potentially effective in this case, the selection of SATSCAN as the tool to account for spatial correlation is likely a product of Chakraborty’s epidemiology background, the discipline SatSCAN is meant to be used for. To attempt to mimic his clustering results using R, I will use a Kulldorff filter in the SpatialEpi R package, a hopefully analogous spatial clustering approach.

To generate clusters, SpatialEpi will use county relative risk (comparing local incidence to global incidence rates) to cluster counties into regional areas of relatively higher risk, grouping counties that have a similarly high risk profile together. The package will return these primary clusters of high-risk-counties and it’s best estimation of any secondary clusters, if they exist. To assess the effectiveness of this method, I will map clusters to visually determine if they are accurately modeling the perceived clustering of high risk counties.

start_time <- Sys.time()
covid_geo <- covid_table %>%
  select(x, y) %>%
  latlong2grid()
# latlong2grid creates approximate equidistant cylindrical grid
# could probably reproject to epsg 5070 and create table with x and y

# calculate expected cases with one strata
expected.cases <- expected(covid_table$pop, covid_table$cases, 1)

# Kulldorff spatial scan statistic
covid_kulldorff <- kulldorff(
  geo = covid_geo,
  cases = covid_table$cases,
  population = covid_table$pop,
  expected.cases = expected.cases,
  pop.upper.bound = 0.5,
  n.simulations = 999,
  alpha.level = 0.05,
  plot = TRUE
)

print(
  paste(
    "Run time:",
    round(difftime(Sys.time(), start_time, units = "mins"), 2),
    "minutes"
  ),
  quote = FALSE
)


# save results in a file appended with the current date
saveRDS(covid_kulldorff,
  file = here("data", "derived", "public", paste0("covid_kulldorff_", Sys.Date(), ".RDS"))
)
## [1] Most likely cluster:
## $location.IDs.included
##  [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838  280 1760 1846 1756
## 
## $population
## [1] 16949211
## 
## $number.of.cases
## [1] 469091
## 
## $expected.cases
## [1] 233805.6
## 
## $SMR
## [1] 2.006329
## 
## $log.likelihood.ratio
## [1] 97983.07
## 
## $monte.carlo.rank
## [1] 1
## 
## $p.value
## [1] 0.001
## [1] Number of Secondary clusters: 132

I’ll first match the FIPS code of each assumed cluster to the FIPS codes of counties found in covid_table, the non-geometry DF verson of the JHU covid data. From these matched clusters and FIPS codes I’ll be able to map the assessed clusters qualitatively. This map should match Chakraborty’s clustering data that he inputted to his GEEs.

# list of primary cluster counties
primary_cluster_fips <- covid_kulldorff$most.likely.cluster$location.IDs.included

clusters <- data.frame(
    fips = covid_table[primary_cluster_fips, "fips"],
    clusterID = covid_table[primary_cluster_fips[1], "fips"],
    likelihood = covid_kulldorff$most.likely.cluster$log.likelihood.ratio,
    stringsAsFactors = FALSE
  )
# Get a list of secondary clusters
secondary <- covid_kulldorff$secondary.clusters

# similarly add counties in each secondary cluster to the list of clusters
for (i in secondary) {
  cluster_locations <- i$location.IDs.included
  new_clusters <- data.frame(
    fips = covid_table[cluster_locations, "fips"],
    clusterID = covid_table[cluster_locations[1], "fips"],
    likelihood = i$log.likelihood.ratio,
    stringsAsFactors = FALSE
  )
  clusters <- rbind(clusters, new_clusters)
}

Joining the cluster data back to the acs_covid dataframe

acs_covid <- acs_covid %>%
  left_join(clusters, by = "fips") 

acs_covid<-acs_covid|>
  mutate(isCluster = case_when(
    clusterID == fips ~ "center of cluster",
    !is.na(clusterID) ~ "other part of cluster",
    .default = NA
  ))

Planned deviation for reproduction: Map the SpatialEpi cluster results.

acs_covid <- st_as_sf(acs_covid)

tm_spatialepi_clusters <-
  tm_shape(acs_covid) +
  tm_polygons(col = "isCluster",
          palette = "-Oranges",
          popup.vars = c("fips", "clusterID"),
          colorNA = "grey",
          title = "SpatialEpi Kulldorff COVID-19 Clusters",
          border.col = "white",
          lwd = 0.2,
          border.alpha = 0.2) +
  tm_shape(state) +
  tm_borders("grey", lwd = .5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

tm_spatialepi_clusters

This map shows our extracted covid clusters, with the center county of the cluster in red and the counties within the cluster shown in cream. On first pass, we see a huge cluster in the sounteast centered in the Florida Panhandle, spreading through much of the south. We also see several single-county clusters, especially in the more sparsely populated Midwest and West. These are likely by themselves because the surrounding regions lack the popultaion density to spread Covid effectively.

I’ll now calculate the relative risk, a product of Chakraborty’s SATSCAN analysis but not of the SpatialEpi R package. This will need to be grouped by cluster to match the final SatScan output.

total_pop <- sum(acs_covid$pop)
total_cases <- sum(acs_covid$cases)

acs_covid <- acs_covid %>%
  group_by(clusterID) %>%
  mutate(
    rr_cluster = ifelse(is.na(clusterID), NA,
      (sum(cases) / sum(pop)) / ((total_cases - sum(cases)) / (total_pop - sum(pop)))
    )
  ) %>%
  ungroup() %>%
  mutate(
    rr_loc = (cases / pop) / ((total_cases - cases) / (total_pop - pop))
  )

Relative risk will be classified on a scale from 1 to 6.

Relative Risk Values Relative Risk Class
Outside of cluster 1
RR < 1 1
1 <= RR < 2 2
2 <= RR < 3 3
3 <= RR < 4 4
4 <= RR < 5 5
5 <= RR < 6 6

Counties falling outside of any cluster are assigned a score of 1.

# class breaks
breaks <- c(-Inf, 1, 2, 3, 4, 5, Inf)

acs_covid <- acs_covid %>%
  mutate(
    cluster_class = ifelse(is.na(clusterID), 1, cut(rr_cluster, breaks, labels = FALSE)),
    loc_class = cut(rr_loc, breaks, labels = FALSE)
  )

Map relative risk scores

Let’s visualize these relative risk scores spatially!

First, map the spatial distribution of local relative risk score classifications by county.

# count the frequency of counties in each class and create labels
class_freq <- acs_covid %>% st_drop_geometry() %>% count(loc_class)
class_freq$qual <- ifelse(class_freq$n > 1, " counties", " county")
class_freq[1, ]$qual <- paste(class_freq[1, ]$qual, "at low risk")
class_freq[nrow(class_freq), ]$qual <- paste(class_freq[nrow(class_freq), ]$qual, "at high risk")
class_freq$label <- paste0(class_freq$loc_class,
                           " (",
                           class_freq$n,
                           class_freq$qual,
                           ")")

# Map Local Relative Risk scores
tm_spatialepi_local_risk_class <- tm_shape(acs_covid) +
  tm_polygons("loc_class",
    title = "Local Relative Risk Class",
    border.col = "white",
    border.alpha = .2,
    lwd = 0.2,
    palette = "Oranges",
    style = "cat",
    labels = class_freq$label
  ) +
  tm_shape(state) +
  tm_borders("grey", lwd = .5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

tm_spatialepi_local_risk_class

Next, I map the cluster relative risk scores for comparison. Note that following the original study classification methodology, counties outside of clusters are assigned the lowest risk class of 1.

# count the frequency of counties in each class and create labels
class_freq <- acs_covid %>% st_drop_geometry() %>% count(cluster_class)
class_freq$qual <- ifelse(class_freq$n > 1, " counties", " county")
class_freq[1, ]$qual <- paste(class_freq[1, ]$qual, "at low risk")
class_freq[nrow(class_freq), ]$qual <- paste(class_freq[nrow(class_freq), ]$qual, "at high risk")
class_freq$label <- paste0(class_freq$cluster_class,
                           " (",
                           class_freq$n,
                           class_freq$qual,
                           ")")

# map cluster relative risk scores
tm_spatialepi_cluster_risk_class <- tm_shape(acs_covid) +
  tm_polygons("cluster_class",
    title = "Cluster Relative Risk Class",
    border.col = "white",
    border.alpha = .2,
    lwd = 0.2,
    palette = "Oranges",
    style = "cat",
    labels = class_freq$label
  ) +
  tm_shape(state) +
  tm_borders("grey", lwd = .5) +
  tm_layout(
    legend.position = c("left", "bottom"),
    legend.title.size = 0.8,
    legend.text.size = 0.5
  )

rm(class_freq)

tm_spatialepi_cluster_risk_class

We can see that the cluster relative risk does a decent job of describing the overall relative risk, with our large southeast cluster capturing a lot of the higher risk across the region. Isolated clusters in the Midwest also capture spatial differences in relative risk, with different risks seen in adjacent cluster.

To properly compare the SpatialEpi results to the SATSCAN output, we should probably know how many clusters we have. I’m going to count em and calculate the cluster with the largest number of included counties. The original study found 102 unique clusters with between 1 and 255 counties in each cluster.

# summarize clustering results
# summarize clustering results
cluster_summary <- acs_covid %>%
  drop_na(clusterID)|>
  dplyr::filter(cases > 0) %>%
  st_drop_geometry() %>%
  count(clusterID)
cat(
  length(cluster_summary$n),
  "unique clusters based on spatialEpi CLUSTER relative risk\n"
)
## 133 unique clusters based on spatialEpi CLUSTER relative risk
summary(cluster_summary$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   7.023   2.000 599.000
rm(cluster_summary)

We failed to reproduce the SATSCAN output, producing 133 unique clusters with between 1 and 599 counties, many more than the original output. This is a interesting result, potentially pointing to our implementation of the Kulldorf Scan Statistic differing from SATSCANS. One hypothesis is that it is clustering more sensitively to lower levels of risk, as seen in the southeast megacluster (clusterID 12005), centered on Bay County, Florida. At the same time as we see more preferential grouping into clusters, we also see the sheer numbers of clusters increase. This suggests that spatialepi is, in addition to being more willing to group counties together, happy to initialize more clusters.

Discussion

Because this study seeks to both A) Reproduce the first part of Chakraborty’s work, assessing for validity and B) use our own methodology to recreate clustering done in an external program, analysis will be split into two sections. The first will explore the correlation between disability rate at the county scale and incidence - if there is a relationship between the disability subcategories and covid incidence like Chakraborty found, we should return significant r values. The second section will compare our finished clusters to Chakraborty’s before thinking about why any differences might occur, given that we’re using a different program to generate our clusters than he did. I’m not expecting that the clusters will turn out exactly the same, so it’ll be interesting to interrogate why any differences arise.

Disability Correlation

Agreement was found between our mapped disability data and Chakraborty’s Figure one, producing results without discernible differences. Summary statistics told a relatively similar story, with slight differences, likely in rounding, that reflected minorly different methodology. Specifically, it appeared that Chakraborty rounded Covid cases to the nearest whole person, a step I neglected to stay true to the data.

After performing Pearson’s correlation on the variables in the data, following Chakraborty’s methods, I subtracted resulting significance values from those presented in the corresponding table in the original paper, which was loaded for comparison. They are mostly identical with similar significance thresholds, but slightly different actual values. This could be attributable to the difference in rounding methodology. The only variable that changed significance threshold was pct_18_34, which changed from significance value 1 to value 2.

Testing the correlation between values using a bivariate nonparametric correlation test (Spearman’s rho) analysis to assess the validity of Chakraborty’s parametric test used even though the dataset wasn’t normally distributed revealed that several significance values changed, with native_pct and other_pct gaining significance and pct_18_34 being reassigned to a lower (1 star) significance level.

All in all, this section of the reproduction was a huge success. Despite the rounding issues we were, for the most part able to reproduce Chakraborty’s results, fuifuilling our first goal of reproducing his workflow in open source code. We were also able to add to the analysis by including a test for normality and due to the non-normal distribution of data perform a non-parametric correlation test. This test produced some different results than Chakraborty’s original analysis, deepening the degree of analysis.

Clustering

Implementation of the Kulldorff spatial scan statistic in the spatialepi package instead of SATSCAN was less successful. Clusters were successfully pulled from the data but they did not match the size and number of clusters from the original manuscript. We found more clusters (133 vs 102) and larger clusters (599 vs 255), calling into question the applicability of this package to reproduce the clustering operation of SATSCAN. That being said, the clusters produced do appear to be sensible given the underlying data, so this is likely reflective of a difference in software application instead of an incorrect application of the package.

Integrity Statement

This is the only preregistration for this study.

Acknowledgements

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)

References

Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. DOI:[10.1016/j.dhjo.2020.101007](https://doi.org/10.1016/j.dhjo.2020.101007)

Grosjean, Philippe, and Frederic Ibanez. 2024. Pastecs: Package for Analysis of Space-Time Ecological Series. https://github.com/SciViews/pastecs.
Kim, Albert Y., Jon Wakefield, and Mikael Moise. 2023. SpatialEpi: Methods and Data for Spatial Epidemiology. https://github.com/rudeboybert/SpatialEpi.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://here.r-lib.org/.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
———. 2024. Sf: Simple Features for r. https://r-spatial.github.io/sf/.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Tennekes, Martijn. 2018. tmap: Thematic Maps in R.” Journal of Statistical Software 84 (6): 1–39. https://doi.org/10.18637/jss.v084.i06.
———. 2025. Tmap: Thematic Maps. https://github.com/r-tmap/tmap.
Walker, Kyle, and Matt Herman. 2025. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with Kable and Pipe Syntax. http://haozhu233.github.io/kableExtra/.